生成对抗网络(GAN)的适应旨在将预训练的GAN转移到具有有限培训数据的给定领域。在本文中,我们专注于单次案例,这在以前的作品中更具挑战性,很少探索。我们认为,从源域到目标域的适应性可以分为两个部分:全球样式(如纹理和颜色)的转移,以及不属于源域的新实体的出现。虽然先前的作品主要关注样式转移,但我们提出了一个新颖而简洁的框架\ footNote {\ url {https://github.com/thevoidname/generalized-onerized-one-one-shot-gan-adaption}},以解决\ textit {对样式和实体传输的一般性单发适应性}任务,其中提供了参考图像及其二进制实体掩码。我们的核心目标是通过切成薄片的瓦斯坦距离来限制参考文献和合成的内部分布之间的差距。为了更好地实现这一目标,首先使用样式固定来大致获得模范样式,并将辅助网络引入原始生成器以删除实体和样式传输。此外,为了实现跨域的对应关系,我们提出了变异的拉普拉斯正则化以限制适应性发生器的平滑度。定量和定性实验都证明了我们方法在各种情况下的有效性。
translated by 谷歌翻译
在本文中,我们专注于分析和改进视觉变压器自我发项层的辍学技术,这很重要,同时令人惊讶地被先前的作品忽略了。特别是,我们对三个核心问题进行研究:首先,自我发挥层的下降是什么?不同于文献中的注意力重量不同,我们建议在注意矩阵计算之前向前移动辍学操作,并将钥匙设置为辍学单元,从而产生一种新颖的辍学效果。从理论上讲,我们验证了该方案是否有助于保持注意力重量的正则化和概率特征,从而减轻了过度拟合问题的特定模式,并增强了模型以捕获重要信息;第二,如何在连续层中安排下降比?与利用所有层的恒定下降比相反,我们提出了新的减少时间表,该计划逐渐降低了沿自我注意力层的堆叠比率。我们通过实验验证提出的时间表可以避免在低水平特征中过度贴合,并且在高级语义中缺失,从而提高了模型训练的稳健性和稳定性;第三,是否需要执行结构化辍学操作为CNN?我们尝试基于补丁的辍学操作区块,发现CNN的这种有用的技巧对于VIT并不是必需的。考虑到以上三个问题的探索,我们提出了一种新颖的Dropkey方法,该方法将密钥视为下降单元和利用下降比的减少时间表,以一般方式改善VIT。全面的实验证明了Dropkey对各种VIT体系结构的有效性,\ Emph {e.g。} T2T和Volo以及各种视觉任务,\ Emph {e.g。},图像分类,对象检测,人类对象相互作用和人体形状检测和人体形状恢复。代码将在接受后发布。
translated by 谷歌翻译
从单个样本产生图像,作为图像合成的新发展分支,引起了广泛的关注。在本文中,我们将该问题与单个图像的条件分布进行采样,提出了一种分层框架,通过关于结构,语义和纹理的分布的连续学习来简化复杂条件分布的学习学习和一代可理解。在此基础上,我们设计由三个级联的GAN组成的Exsingan,用于从给定的图像学习可解释的生成模型,级联的GANS连续模拟结构,语义和纹理的分布。由于以前的作品所做的,但也是从给定图像的内部补丁来学习的,而且来自GaN反演技术的外部获得的外部。与先前作品相比,Exsingan对内部和外部信息的适当组合有利于内部和外部信息的适当组合,对图像操纵任务进行了更强大的生成和竞争泛化能力。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io
translated by 谷歌翻译
Learning the underlying distribution of molecular graphs and generating high-fidelity samples is a fundamental research problem in drug discovery and material science. However, accurately modeling distribution and rapidly generating novel molecular graphs remain crucial and challenging goals. To accomplish these goals, we propose a novel Conditional Diffusion model based on discrete Graph Structures (CDGS) for molecular graph generation. Specifically, we construct a forward graph diffusion process on both graph structures and inherent features through stochastic differential equations (SDE) and derive discrete graph structures as the condition for reverse generative processes. We present a specialized hybrid graph noise prediction model that extracts the global context and the local node-edge dependency from intermediate graph states. We further utilize ordinary differential equation (ODE) solvers for efficient graph sampling, based on the semi-linear structure of the probability flow ODE. Experiments on diverse datasets validate the effectiveness of our framework. Particularly, the proposed method still generates high-quality molecular graphs in a limited number of steps.
translated by 谷歌翻译
Deep neural networks are vulnerable to adversarial attacks. In this paper, we take the role of investigators who want to trace the attack and identify the source, that is, the particular model which the adversarial examples are generated from. Techniques derived would aid forensic investigation of attack incidents and serve as deterrence to potential attacks. We consider the buyers-seller setting where a machine learning model is to be distributed to various buyers and each buyer receives a slightly different copy with same functionality. A malicious buyer generates adversarial examples from a particular copy $\mathcal{M}_i$ and uses them to attack other copies. From these adversarial examples, the investigator wants to identify the source $\mathcal{M}_i$. To address this problem, we propose a two-stage separate-and-trace framework. The model separation stage generates multiple copies of a model for a same classification task. This process injects unique characteristics into each copy so that adversarial examples generated have distinct and traceable features. We give a parallel structure which embeds a ``tracer'' in each copy, and a noise-sensitive training loss to achieve this goal. The tracing stage takes in adversarial examples and a few candidate models, and identifies the likely source. Based on the unique features induced by the noise-sensitive loss function, we could effectively trace the potential adversarial copy by considering the output logits from each tracer. Empirical results show that it is possible to trace the origin of the adversarial example and the mechanism can be applied to a wide range of architectures and datasets.
translated by 谷歌翻译
This paper presents a novel framework for planning in unknown and occluded urban spaces. We specifically focus on turns and intersections where occlusions significantly impact navigability. Our approach uses an inpainting model to fill in a sparse, occluded, semantic lidar point cloud and plans dynamically feasible paths for a vehicle to traverse through the open and inpainted spaces. We demonstrate our approach using a car's lidar data with real-time occlusions, and show that by inpainting occluded areas, we can plan longer paths, with more turn options compared to without inpainting; in addition, our approach more closely follows paths derived from a planner with no occlusions (called the ground truth) compared to other state of the art approaches.
translated by 谷歌翻译
Video representation learning has been successful in video-text pre-training for zero-shot transfer, where each sentence is trained to be close to the paired video clips in a common feature space. For long videos, given a paragraph of description where the sentences describe different segments of the video, by matching all sentence-clip pairs, the paragraph and the full video are aligned implicitly. However, such unit-level similarity measure may ignore the global temporal context over a long time span, which inevitably limits the generalization ability. In this paper, we propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly. As the video/paragraph is formulated as a sequence of clips/sentences, under the constraint of their temporal order, we use dynamic time warping to compute the minimum cumulative cost over sentence-clip pairs as the sequence-level distance. To explore the temporal dynamics, we break the consistency of temporal order by shuffling the video clips or sentences according to the temporal granularity. In this way, we obtain the representations for clips/sentences, which perceive the temporal information and thus facilitate the sequence alignment. In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between different video instances. We evaluate our approach on video retrieval, action step localization, and few-shot action recognition, and achieve consistent performance gain over all three tasks. Detailed ablation studies are provided to justify the approach design.
translated by 谷歌翻译